verkko-fillet : post verkko graph and assembly cleaning in Python¶

verkko-fillet is an easy-to-use toolkit for cleaning Verkko assemblies. It includes tools for generating the Verkko-Fillet object, performing assembly quality checks, identifying gaps, assigning chromosomes, searching for ONT reads to help resolve gaps, filling gaps, and generating the final Rukki path (in a GAF-like format) for future Verkko CNS runs.

This Python-based implementation streamlines the entire process, starting right after the Verkko assembly is completed and preparing for the CNS run.

In [1]:
%load_ext autoreload
%autoreload 2
In [23]:
import sys 
import importlib
import pandas as pd
import time
import os
from IPython.display import Image, display
pd.set_option('mode.chained_assignment', None)
from itables import init_notebook_mode
init_notebook_mode(all_interactive=True)
import warnings
import session_info
# Suppress FutureWarnings
warnings.simplefilter(action='ignore', category=FutureWarning)
In [172]:
sys.path.append("/vf/users/Phillippy/projects/giraffeT2T/assembly/script/post_verkko_pkg/src/")

# import verkkoFillet as vf
import verkkoFillet as vf
importlib.reload(vf)
Out[172]:
<module 'verkkoFillet' from '/vf/users/Phillippy/projects/giraffeT2T/assembly/script/post_verkko_pkg/src/verkkoFillet/__init__.py'>

The verkkoFillet module, abbreviated as vf, consists of three sub-modules: pp, tl, and pl.

  • vf.tl: Provides tools for running shell scripts.
  • vf.pp: Includes modules for preprocessing and modifying datasets.
  • vf.pl: Dedicated to plotting and visualization.

Loading verkko directory and building the verkko-fillet object¶

The input file is very simple. You only need the Verkko output directory, as it is well-structured.

In [181]:
verkkoDir="/data/Phillippy/projects/giraffeT2T/assembly/verkko2.2_hifi-duplex_trio-hic/verkko-thic_legacy/"
os.chdir(verkkoDir)
In [182]:
obj = vf.pp.read_Verkko(verkkoDir)
Loading paths.tsv file...
Loading Ont alignment GAF file...
In [183]:
obj
Out[183]:
verkkoDir: /data/Phillippy/projects/giraffeT2T/assembly/verkko2.2_hifi-duplex_trio-hic/verkko-thic_legacy/
paths: name, path, assignment
gaf: Qname, path, mapQ, identity, path_modi
In [184]:
image_path = "/vf/users/Phillippy/projects/giraffeT2T/assembly/script/post_verkko_pkg/data/test_giraffe/fig/giraffe_complete_verkko-verkko_filletObj.png"
display(Image(filename=image_path,width=700))
No description has been provided for this image

The Verkko Fillet object will read all files in the Verkko output directory, including the final Rukki path file, ONT alignment, and others, if they exist. However, don't worry if your Verkko directory does not include these files. In this Jupyter notebook, we will generate all the necessary files and statistics and load them into the Verkko Fillet object later. Once you build your own Verkko Fillet object, you can save it locally as a Python pickle file, allowing you to load it anytime for future curation tasks.

You can use the Verkko Fillet Python module to view and edit the object using specific functions. Additionally, you can easily access each piece of data stored in the object through its attributes (e.g., obj.verkkoDir).

When you print the object, it will display the types of data stored within it.

Getting the t2t stats¶

Before starting curation and polishing the assembly, it is helpful to examine the statistics and quality of the initial assembly.

Firstly, run the vf.pp.getT2T function to retrieve the T2T statistics, including the number of scaffolds, contigs, telomeres, and gaps. This function generates four files in the Verkko output directory using the assembly.fasta file:

  • assembly.telomere.bed
  • assembly.gaps.bed
  • assembly.t2t_scfs
  • assembly.t2t_ctgs
In [8]:
%%time
vf.tl.getT2T(obj)
getT2T was done!
CPU times: user 1.07 ms, sys: 211 ms, total: 212 ms
Wall time: 6.16 s

Collapsing rDNA nodes.¶

As default, vf.tl.rmrDNA function uses homopolymer-compressed rDNA morph to screen and collapse in the Verkko graph, along with the assembly.telomere.bed file generated above using vf.tl.getT2T(obj). This process marks telomere nodes at the end of the contigs and generates three output files:

  1. target.screennodes.out: This file includes the nodes that have rDNA sequences.
  2. assembly.homopolymer-compressed.noseq.telo_rdna.gfa: The graph with the rDNA collapsed and telomere nodes added.
  3. assembly.colors.telo_rdna.csv: Adds telomere nodes marked in green (#008000) and rDNA nodes marked in purple (#A020F0), in addition to the original assembly.colors.csv.
In [9]:
vf.tl.rmrDNA(obj)
Starting removing rDNA nodes in the graph
remove rDNA was done!
Output files: 
target.screennodes.out
assembly.homopolymer-compressed.noseq.telo_rdna.gfa
assembly.colors.telo_rdna.csv

QV calculation¶

QV (Quality Value) is fundamental metric for assessing the quality of an assembly.

The input for QV calculation is a list of high-quality sequencing data, such as HiFi reads, used for the Verkko assembly. K-mers are extracted separately from the sequencing data and the reference assembly. The number of shared k-mers between the two is then used to calculate the QV.

  • A high QV indicates a high-quality assembly, where most k-mers are shared with the original high-quality sequencing reads.
  • A low QV suggests that the assembly contains many errors not present in the high-quality sequencing data.
In [10]:
fofn = "/data/Phillippy/projects/giraffeT2T/assembly/verkko2.2_hifi-duplex_trio-hic/verkko-thic_v0.1.1/10-evaluation/kmers/fofn/child.fofn"

the kmer databases for each data and assembly will generatedd by meryl

In [1]:
%%time
vf.tl.makeMeryl(obj, fofn = fofn, mergury_dir= "/gpfs/gsfs11/users/kimj75/program/merqury-1.3")
CPU times: user 1 µs, sys: 0 ns, total: 1 µs
Wall time: 3.1 µs
In [1]:
%%time
vf.tl.calQV(obj, mergury_dir= "/gpfs/gsfs11/users/kimj75/program/merqury-1.3")
CPU times: user 0 ns, sys: 25 µs, total: 25 µs
Wall time: 46.3 µs
In [185]:
obj = vf.pp.getQV(obj)
In [186]:
obj
Out[186]:
verkkoDir: /data/Phillippy/projects/giraffeT2T/assembly/verkko2.2_hifi-duplex_trio-hic/verkko-thic_legacy/
paths: name, path, assignment
gaf: Qname, path, mapQ, identity, path_modi
qv: asmName, nKmer_uniq_asm, nKmer_total, QV, ErrorRate
In [187]:
obj.qv
Out[187]:
asmName nKmer_uniq_asm nKmer_total QV ErrorRate
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)

Each column shows:

  1. Assembly of interest: This is a combination of the two datasets mentioned above.
  2. K-mers uniquely found only in the assembly
  3. Total K-mers found in the assembly
  4. QV (Quality Value)
  5. Error rate

For more detailed information, please visit the Merqury GitHub.

In [13]:
vf.pl.qvPlot(obj)
/vf/users/Phillippy/projects/giraffeT2T/assembly/script/post_verkko_pkg/src/verkkoFillet/plotting/_baseQC.py:45: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45, ha='right')
No description has been provided for this image

Chromosome assignment¶

To check which chromosomes are contigs and to pair the same chromosomes from two haplotypes, chromosome assignment is crucial. The vf.tl.chrAssign function will run Minimap to align the assembly to a user-provided reference FASTA file and obtain high-confidence chromosome assignments for each contig in the assembly. Once this is done, we will load not only the chromosome assignment results but also the T2T stats generated earlier using the vf.pp.getT2T function.

The chromosome.map file has two columns:

  1. One column contains the names of contigs associated with each primary chromosome (starting with CN, CM, etc., in the reference FASTA file).
  2. The other column contains the names of the primary chromosomes (e.g., chr1, chr2, etc.).

These contig names need to be converted into easy-to-read chromosome names.

In [189]:
ref = "/vf/users/Phillippy/projects/giraffeT2T/assembly/ncbiRef/GCA_017591445.1_ASM1759144v1_genomic.fna"
mapfile="/vf/users/Phillippy/projects/giraffeT2T/assembly/verkko2.2_hifi-duplex_trio-hic/verkko-thic_legacy/chromosome.map"
In [56]:
%%time
vf.tl.chrAssign(obj = obj, ref = ref, chr_name="CM")
get Chr name was done!
Standard Output: Using custom reference sequence /vf/users/Phillippy/projects/giraffeT2T/assembly/ncbiRef/GCA_017591445.1_ASM1759144v1_genomic.fna
113 sire_ dam_ compNC: . regNC: CM
In [190]:
obj = vf.pp.readChr(obj,mapfile)
The chromosome infomation was stored in obj.stats
In [191]:
obj
Out[191]:
verkkoDir: /data/Phillippy/projects/giraffeT2T/assembly/verkko2.2_hifi-duplex_trio-hic/verkko-thic_legacy/
stats: contig, scf_ctg, ref_chr, contig_len, ref_chr_len, hap, chr, completeness
paths: name, path, assignment
gaf: Qname, path, mapQ, identity, path_modi
qv: asmName, nKmer_uniq_asm, nKmer_total, QV, ErrorRate
In [192]:
obj.stats.head()
Out[192]:
contig scf_ctg ref_chr contig_len ref_chr_len hap chr completeness
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)
In [197]:
tab = obj.stats.groupby(['hap','chr'])['contig'].count().reset_index()
# tab.loc[tab['contig'] > 2,:]
tab
Out[197]:
hap chr contig
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)
In [18]:
vf.pl.completePlot(obj)
No description has been provided for this image
In [19]:
# the chromosome ordering should be fixed
In [20]:
vf.pl.contigLenPlot(obj)
No description has been provided for this image
In [21]:
vf.pl.contigPlot(obj)
No description has been provided for this image
  • Brick Red: Contig (complete T2T without gaps)
  • Salmon Color: Scaffold (T2T with gaps or tangles)
  • Baige : Not a scaffold (missing one of the telomere)

Compleasm¶

In [22]:
# to be added
In [ ]:
 

Lets fillet the verkko assembly!¶

Verkko is a very powerful tool for assembling long-read data to generate a complete diploid assembly, but it still has some limitations, such as generating gaps, tangles, missing telomeres, and more. These issues can occur due to sequencing errors, lower sample quality, or overly complex or homogeneous sequences, which are difficult to assemble. As a result, some manual inspection is needed. Below, we classify the scenarios that can lead to gaps in the assembly and their potential solutions.

1. Missing Telomeres

  • Broken Contig: This can occur when large tangles or gaps exist in the middle of a contig, causing Rukki to split the contig. This can be resolved by connecting the contigs with a gap in the Rukki path.

  • Internal Telomere Sequencing: In this case, additional sequences are added after the telomere, preventing the getT2T function from detecting the telomere and reporting the contig as a scaffold, even if it contains an internal telomere. This issue might arise due to ONT sequencing bias and can be fixed by trimming the sequences before the internal telomere after identifying its start position.

  • Missing Sequence at the End: If you see isolated telomere-attached nodes, you can find homologous nodes to determine the counterpart of the haplotype on the chromosome and stitch them together by inserting a gap.

2. Gaps, Bubbles, and Tangles

  • rDNA: The number of rDNA arrays and their morphs vary by species, and rDNA sequences are too similar to be assembled automatically. We recommend running ribotin to find the consensus of the rDNA sequences and patch the assembly with the rDNA consensus.

  • Complex Tangles: Sometimes, Hifiasm can generate longer contigs that cover tangles in the Verkko assembly. We can use the HiFi assembly to align it on the Verkko graph and get hints from the alignment to resolve the walk on the nodes.

  • Simple Tangles or Bubbles: In this scenario, ONT alignments on the Verkko graph using graphaligner can be helpful. Searching for supported paths that connect nodes in the gap can help resolve simple tangles.

  • Edge Missing: If your assembly has missing edges between two nodes, this might be due to too few supported ONT reads (<3) connecting the nodes. We can find the supported split ONT reads aligned on nearby nodes and add the edge to the graph.

  • Loops: Repeated sequences with multiple copies can be problematic to assemble, especially when the repeat is long enough to span ONT reads at both ends of the flanking nodes. In this case, we edit the gaps with the estimated number of loops and fill the gaps with semi-filled gaps.

3. Haplotype Unassignment

  • One Haplotypes is Ambiguous: This scenario occurs when heterozygous nodes flank long runs of homozygous nodes, making Verkko unsure which haplotype should be assigned. If we have evidence that the other haplotype has already been assigned to one of the nodes in the gap, we can assign the other haplotype to the ambiguous node.

  • Both Haplotypes are Ambiguous: When both haplotypes have gaps in a bubble, there’s no way to determine the correct haplotype for the gap. In this case, we will randomly assign the haplotypes. This can be corrected during the polishing step later.

In [24]:
image_path = "/vf/users/Phillippy/projects/giraffeT2T/assembly/script/post_verkko_pkg/data/test_giraffe/fig/verkko_curation_decision_tree.png"
display(Image(filename=image_path,width=1500, height = 500))
No description has been provided for this image

Missing Telomere¶

3. Internal telomere sequencing detection¶

Sometimes, additional sequences are added to the end of a contig beyond the telomere. As a result, the vf.tl.getT2T function classifies them as not scaffolds because it cannot detect telomere sequences at the end of the contig.

To check if your assembly has this kind of issue, you can use the following functions:

  • vf.tl.run_intra_telo(): Finds all telomere sequences at the whole-genome level.
  • vf.pp.find_intra_telo(): Helps identify which contigs contain internal telomere sequences.

You can trim these sequences later, after running the final Verkko consensus step.

In [25]:
# vf.tl.run_intra_telo(obj)
In [15]:
file = "/vf/users/Phillippy/projects/giraffeT2T/assembly/verkko2.2_hifi-duplex_trio-hic/verkko-thic_v0.0.4/giraffe/assembly.windows.0.5.bed"
vf.pp.find_intra_telo(obj, file=file, loc_from_end=15000)
Out[15]:
contig ref_chr chr hap start end len_fai
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)

The dam_compressed.k31.hapmer-0000002 contig (chr11 mat) was reported as a scaffold (see the vf.pl.contigPlot(obj) plot) because the start telomere begins inside the telomeric region (18,800 bp from the first base at one end). This caused getT2T to miss it. We will trim this contig after running Verkko-consensus to finalize the assembly and fix the issue.

Gaps / Bubbles / Tangles¶

Finding gaps from path¶

Using the Rukki final paths that are already loaded in the object, we can retrieve the gaps with flanking nodes and the names of the contigs using the vf.pp.findGaps function. This function will store the list of gaps in the obj.gaps attribute, and you can access the table.

In [22]:
%%time
obj = vf.pp.findGaps(obj)
43 gaps were found -> obj.gaps
CPU times: user 495 ms, sys: 554 ms, total: 1.05 s
Wall time: 1.05 s
In [23]:
print(obj)
verkkoDir: /data/Phillippy/projects/giraffeT2T/assembly/verkko2.2_hifi-duplex_trio-hic/verkko-thic_legacy/
stats: contig, scf_ctg, ref_chr, contig_len, ref_chr_len, hap, chr, completeness
paths: name, path, assignment, gaps
gaps: gapId, name, gaps, notes, fixedPath, startMatch, endMatch, finalGaf, done
gaf: Qname, path, mapQ, identity, path_modi
paths_freq: path, size, path_modi
qv: asmName, nKmer_uniq_asm, nKmer_intersect, QV, ErrorRate

The first three columns are generated from the Rukki path, and the others are the columns we will fill during gap filling. The gaps column contains each gap along with the flanking nodes, allowing you to easily visualize how the gaps or bubbles look using the flanking nodes in Bandage.

In [14]:
obj.gaps
Out[14]:
gapId name gaps notes fixedPath startMatch endMatch finalGaf done
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)

Align ONT Reads onto the graph¶

All ONT datasets used during the Verkko assembly are stored in the Verkko output directory. We use the ONT reads and align them to the Verkko graph using graphAligner. This alignment helps us create reasonable paths through the gaps and bubbles. In future versions of Verkko, this step will be integrated into the assembly process.

vf.tl.graphIdx() will index the assembly.homopolymer-compressed.gfa graph, and vf.tl.graphAlign() uses the ONT reads under the 3-align/split directory. The output GAF file will be saved to the 9-graphAlignment/verkko.graphAlign_allONT.gaf.

In [25]:
vf.tl.graphIdx(obj,threads=1)
In [26]:
vf.tl.graphAlign(obj = obj, threads=50)
In [57]:
obj = vf.pp.readGaf(obj)
Looking for GAF file at: /vf/users/Phillippy/projects/giraffeT2T/assembly/verkko2.2_hifi-duplex_trio-hic/verkko-thic_legacy/9-graphAlignment/verkko.graphAlign_allONT.gaf
Loading ONT alignment GAF file...
GAF file successfully loaded.
In [31]:
obj
Out[31]:
verkkoDir: /data/Phillippy/projects/giraffeT2T/assembly/verkko2.2_hifi-duplex_trio-hic/verkko-thic_legacy/
stats: contig, scf_ctg, ref_chr, contig_len, ref_chr_len, hap, chr, completeness
paths: name, path, assignment, gaps
gaps: gapId, name, gaps, notes, fixedPath, startMatch, endMatch, finalGaf, done
gaf: Qname, path, mapQ, identity, path_modi
paths_freq: path, size, path_modi
qv: asmName, nKmer_uniq_asm, nKmer_intersect, QV, ErrorRate
In [58]:
obj.gaf.head(2)
Out[58]:
Qname len path mapQ identity path_modi
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)

Searching the nodes in the ONT alingment GAF file.¶

Once you have loaded the ONT alignment into your Verkko Fillet object, you can search using two nodes to view the number of paths that span both nodes using the vf.pp.searchNodes function. The output will have 4 columns:

  1. path: The ONT alignment path.
  2. size: The number of supported reads for the path.
  3. node1: Y if the path spans node1.
  4. node2: Y if the path spans node2.
In [25]:
%%time
node_list_input = ['utig4-317', 'utig4-2181']
vf.pp.searchNodes(obj, node_list_input)
`obj.paths_freq` already exists.
Extracting paths containing nodes: ['utig4-317', 'utig4-2181']
CPU times: user 25.9 ms, sys: 1.99 ms, total: 27.9 ms
Wall time: 27.8 ms
Out[25]:
  path size @utig4-317@ @utig4-2181@
14894 >utig4-317<utig4-2324<utig4-2181 2 Y Y
14892 >utig4-317 13920 Y
6472 <utig4-317 13589 Y
3796 <utig4-2181 779 Y
12405 >utig4-2181 658 Y
6473 <utig4-317>utig4-316 93 Y
3797 <utig4-2181>utig4-2180 90 Y
4223 <utig4-2324<utig4-2181 88 Y
6470 <utig4-316>utig4-317 88 Y
3794 <utig4-2180>utig4-2181 87 Y
12406 >utig4-2181>utig4-2324 86 Y
12849 >utig4-2324<utig4-317 85 Y
14893 >utig4-317<utig4-2324 78 Y
3798 <utig4-2181>utig4-2180<utig4-1681 1 Y
6474 <utig4-317>utig4-316>utig4-319 1 Y
6475 <utig4-317>utig4-316>utig4-320 1 Y
6480 <utig4-319<utig4-316>utig4-317 1 Y

3. Simple Tangles or Bubbles¶

In [33]:
image_path = "/vf/users/Phillippy/projects/giraffeT2T/assembly/script/post_verkko_pkg/data/test_giraffe/fig/unit18.png"
display(Image(filename=image_path,width=500, height = 500))
No description has been provided for this image
In [33]:
%%time
node_list_input = ['utig4-424', 'utig4-1283']
vf.pp.searchNodes(obj, node_list_input)
`obj.paths_freq` already exists.
Extracting paths containing nodes: ['utig4-424', 'utig4-1283']
CPU times: user 10.4 ms, sys: 260 µs, total: 10.7 ms
Wall time: 9.88 ms
Out[33]:
  path size @utig4-424@ @utig4-1283@
829 <utig4-1283<utig4-1281<utig4-424<utig4-421 1 Y Y
827 <utig4-1283 49416 Y
9406 >utig4-1283 49028 Y
4658 <utig4-2456<utig4-1283 89 Y
828 <utig4-1283<utig4-1281 81 Y
9407 >utig4-1283>utig4-2456 76 Y
9402 >utig4-1281>utig4-1283 70 Y
15164 >utig4-421>utig4-424>utig4-1281 66 Y
822 <utig4-1281<utig4-424<utig4-421 62 Y
6763 <utig4-424<utig4-421 7 Y
15163 >utig4-421>utig4-424 2 Y
821 <utig4-1281<utig4-424 1 Y
11578 >utig4-1901<utig4-2456<utig4-1283 1 Y
15173 >utig4-424>utig4-1281 1 Y
In [35]:
vf.pp.highlight_nodes(obj, node = "utig4-1282+")
namepathassignmentgaps
dam_compressed.k31.hapmer_from_utig4-1282utig4-1774-,utig4-1771-,utig4-1773+,utig4-2684+,utig4-2600-,utig4-2596-,utig4-2597+,utig4-2636-,utig4-1227-,utig4-1225-,utig4-878-,utig4-875-,utig4-877+,utig4-2373+,utig4-2374+,utig4-2452-,utig4-692-,utig4-688+,utig4-693+,utig4-1779+,utig4-1781+,utig4-2402+,utig4-1978-,utig4-1974-,utig4-1976+,utig4-2502+,utig4-2018-,utig4-2017+,utig4-2020+,utig4-2676-,utig4-223-,utig4-221+,utig4-225+,utig4-1809+,utig4-1810+,utig4-2721+,utig4-1790-,utig4-1786-,utig4-1787+,utig4-1993-,utig4-1995+,utig4-2426+,utig4-2427+,utig4-2575-,utig4-721-,utig4-720-,utig4-308-,utig4-306+,utig4-309+,utig4-983+,utig4-984+,utig4-2448+,utig4-415-,utig4-413-,utig4-398-,utig4-394-,utig4-395+,utig4-2602-,utig4-1857-,utig4-1855+,utig4-949-,utig4-946-,utig4-947+,utig4-2268+,utig4-2269+,utig4-2642+,utig4-515-,utig4-512-,utig4-513+,utig4-2161+,utig4-1793-,utig4-1791+,utig4-1470-,utig4-1469-,utig4-1251-,utig4-1250+,utig4-1253+,utig4-2573-,utig4-199-,utig4-197+,utig4-200+,utig4-2000+,utig4-2002+,utig4-2403+,utig4-1642-,utig4-1592-,utig4-1589+,utig4-1593+,utig4-1818-,utig4-1819+,utig4-2388+,utig4-2390+,utig4-2490-,utig4-2386-,utig4-2384+,utig4-2257-,utig4-2255-,utig4-474-,utig4-472+,utig4-475+,utig4-2363+,utig4-2327-,utig4-2325+,utig4-2328+,utig4-2650-,utig4-422-,utig4-421+,utig4-425+,utig4-1281+,utig4-1282+,utig4-2456+,utig4-1902-,utig4-1898-,utig4-1899+,utig4-2106-,utig4-2108+,utig4-2305+,utig4-2192-,utig4-2189-,utig4-2190+,utig4-2238+,utig4-1077-,utig4-1074-,utig4-1076+,utig4-2166-,utig4-1101-,utig4-1098-,utig4-1100+,utig4-2751-,utig4-2752+DAM_COMPRESSED.K31.HAPMERNone

In this scenario, we find one supporting ONT alignment spanning the gap and flanking reads. However, the number of supporting reads is too small, so we also check the other haplotype paths by using the vf.pp.highlight_nodes() function with the counterpart nodes on the other haplotype to see if another haplotype uses the other nodes in the bubble. If so, as shown here, we can assign the node utig-424 to the paternal strand to fill the gap.

If you have enough evidence for filling the gaps (or partially filling them), you can use this information to fill the gaps using the vf.pp.fillGaps function. You can easily extract the order and direction of each node using Bandage tools (Output > Specify exact path for copy/save). This information will be stored in the obj.gaps attribute and will be used for generating the edited Rukki path for the Verkko CNS run.

  • Caution: Be careful to maintain the original node direction when filling the gaps.
In [36]:
%%time
obj = vf.pp.fillGaps(obj=obj,
               gapId='gapid_19',
               final_path="utig4-424+, utig4-1281+, utig4-1283+")
final path : >utig4-424>utig4-1281>utig4-1283
Updated gapId gapid_19!
 
✅ The start node and its direction match the original node.
✅ The start node and its direction match the original node.
[=                                                 ] 1/43 gaps filled 
CPU times: user 3.05 ms, sys: 0 ns, total: 3.05 ms
Wall time: 3 ms

Once you edit your path, it would be great to check the obj.gaps to ensure that both the startMatch and the endMatch are set to "match." The vf.pp.fillGaps() function also provides this information when you edit the path. These two pieces of information not only indicate the starting and ending nodes but also account for the direction of the nodes. If you provide reverted nodes, each field will be filled with "notMatch." If this is unintentional, it should be corrected before completing the gap-filling process.

Here, we show how the vf.pp.fillGaps() function reports when we provide different directions for the end nodes.

In [37]:
%%time
obj = vf.pp.fillGaps(obj=obj,
               gapId='gapid_19',
               final_path="utig4-424+, utig4-1281+, utig4-1283-")
final path : >utig4-424>utig4-1281<utig4-1283
Updated gapId gapid_19!
 
✅ The start node and its direction match the original node.
❌ The start node and its direction do not match the original node.
[=                                                 ] 1/43 gaps filled 
CPU times: user 2.81 ms, sys: 0 ns, total: 2.81 ms
Wall time: 2.77 ms

It says that the start node and its direction match, but the end node does not match.

If you want to modify the gap information you have already added, simply use the same function to overwrite it. Additionally, if you want to reset the gap information, set final_path="" in vf.pp.fillGaps.

In [38]:
%%time
obj = vf.pp.fillGaps(obj=obj,
               gapId='gapid_19',
               final_path="")
gapId gapid_19: 'final_path' is empty. Other columns have been reset to 'NA'.
[                                                  ] 0/43 gaps filled 
CPU times: user 2.65 ms, sys: 0 ns, total: 2.65 ms
Wall time: 2.59 ms

4. Edge Missing¶

In [45]:
image_path = "/vf/users/Phillippy/projects/giraffeT2T/assembly/script/post_verkko_pkg/data/test_giraffe/fig/unit17.png"
display(Image(filename=image_path,width=500, height = 500))
No description has been provided for this image

The vf.pp.searchSplit function will find the split reads that are aligned to two nodes that are not connected.

In [103]:
%%time
node_list_input = ["utig4-2329","utig4-2651"]
vf.pp.searchSplit(obj,node_list_input)
CPU times: user 7.19 s, sys: 66.8 ms, total: 7.25 s
Wall time: 7.23 s
Out[103]:
Qname path_modi
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)
In [104]:
split_reads = vf.pp.searchSplit(obj,node_list_input)
In [167]:
vf.tl.insertGap(obj, gapid = "gapid_18", split_reads = split_reads)
Extracting reads...
The split reads for gapid_18 was saved to 11.cleaning/missing_edge/gapid_18.missing_edge.ont_list.txt
The fill the edge was done for gapid_18!
The final path looks like:
{'>utig4-2651<gapmanual-1-len--4567-cov-4<utig4-2329', '>utig4-2329>gapmanual-1-len--4567-cov-4<utig4-2651'}

Troubleshooting : only two reads should be included in the subset.gaf. if you give more than two alignment from one reads, it gives you empty patch.gaf.

In [110]:
%%time
obj = vf.pp.fillGaps(obj=obj,
               gapId='gapid_18',
               final_path="utig4-2329+,gapmanual-1-len--4492-cov-2+,utig4-2651-")
final path : >utig4-2329>gapmanual-1-len--4492-cov-2<utig4-2651
Updated gapId gapid_18!
 
✅ The start node and its direction match the original node.
✅ The start node and its direction match the original node.
[=====                                             ] 5/43 gaps filled 
CPU times: user 9.37 ms, sys: 0 ns, total: 9.37 ms
Wall time: 9.28 ms

4. Loops¶

In [43]:
image_path = "/vf/users/Phillippy/projects/giraffeT2T/assembly/script/post_verkko_pkg/data/test_giraffe/fig/unit2.png"
display(Image(filename=image_path,width=500, height = 500))
No description has been provided for this image

This is one of the rDNA clusters in the Giraffe genome. The node utig4-2421 has a loop, and we don't know how many times the node exists in each haplotype strand. Here, we estimate the minimum number of loops for the node on the maternal strand by finding the number of loop nodes in each ONT alignment that spans the haplotype-assigned nodes near the loop node(in this case, utig4-100 and utig4-495). We then fill the gaps with the estimated number of loops, along with the flanking new gaps. This allows us to build an assembly that may not be perfect but is gradually improving. Additionally, multiple loops could enhance the read alignment in future use.

In [44]:
%%time
vf.pp.estLoops(obj, ["utig4-100", "utig4-495"])
CPU times: user 488 ms, sys: 1.33 s, total: 1.81 s
Wall time: 7.42 s

The most frequent number of loops in the ONT path spanning the loop node and the haplotype-assigned nodes is 3 for this gap. So, we fill this gap with 3 copies of the loop node, flanked by new gaps named loop_uncertain_copy. Additionally, for this case, we add a gap at the end of the new path, which could lead to an "unmatch" warning in the report. However, this is intentionally done, so we can ignore the warning here.

In [40]:
obj = vf.pp.fillGaps(obj=obj,
               gapId='gapid_38',
               final_path="utig4-100+, [N5000N:loop_uncertain_copy],utig4-2421+,utig4-2421+,utig4-2421+,[N5000N:loop_uncertain_copy]")
final path : >utig4-100[N5000N:loop_uncertain_copy]>utig4-2421>utig4-2421>utig4-2421[N5000N:loop_uncertain_copy]
Updated gapId gapid_38!
 
✅ The start node and its direction match the original node.
❌ The start node and its direction do not match the original node.
[==                                                ] 2/43 gaps filled 
In [41]:
vf.pp.highlight_nodes(obj, node = "utig4-495")
namepathassignmentgaps
dam_compressed.k31.hapmer_from_utig4-1104utig4-1107-,utig4-1103-,utig4-1104+,utig4-1511-,utig4-1512+,utig4-1562-,utig4-1561+,utig4-1564+,utig4-1967+,utig4-817-,utig4-813-,utig4-815+,utig4-2301+,utig4-2302+,utig4-2391+,utig4-1294-,utig4-1292+,utig4-1296+,utig4-1454+,utig4-1455+,utig4-2461-,utig4-2463+,utig4-2467+,utig4-2526+,utig4-565-,utig4-564+,utig4-568+,utig4-1057-,utig4-662-,utig4-659-,utig4-661+,utig4-1606+,utig4-1608+,utig4-1609+,utig4-2565+,utig4-2412-,utig4-2410-,utig4-1817-,utig4-1815+,utig4-67-,utig4-66+,utig4-70+,[N212780N:tangle],utig4-100+,[N5000N:ambig_path],utig4-2421+,utig4-495-,utig4-494+,utig4-497+,utig4-1094-,utig4-1096+,utig4-2419-,utig4-2214-,utig4-2212+,utig4-2216+,utig4-2217+,utig4-450-,utig4-449+,utig4-452+,utig4-2290-,utig4-1272-,utig4-1271+,utig4-1274+,utig4-2229-,utig4-1878-,utig4-1877+,utig4-1881+,utig4-2480+,utig4-2335-,utig4-2332-,utig4-2334+,utig4-2477-,utig4-2478+,utig4-1431-,utig4-1429-,utig4-1427+,utig4-1064-,utig4-1063+,utig4-1066+,utig4-1782-,utig4-1784+,utig4-2662+,utig4-2594-,utig4-2592+,utig4-1132-,utig4-1130-,[N5000N:ambig_path],utig4-1082+,utig4-1085+,utig4-2082-,utig4-2084+,utig4-2086+,utig4-2168+,utig4-1949-,utig4-1948+,utig4-1813-,utig4-1812-,utig4-1551-,utig4-1549-,utig4-1268-,utig4-1266+,utig4-1270+,utig4-2441+,utig4-645-,utig4-641-,utig4-642+,utig4-1408+,utig4-1410+,utig4-1924-,utig4-1925+,utig4-2246-,utig4-686-,utig4-683-,utig4-685+,utig4-2099-,utig4-2101+,utig4-2618-,utig4-2245-,utig4-2243-,utig4-1120-,utig4-1116-,utig4-1117+,utig4-1453+,utig4-968-,utig4-967-,utig4-596-,utig4-592-,utig4-593+,utig4-1041+,utig4-368-,utig4-367+,utig4-370+DAM_COMPRESSED.K31.HAPMERNone

6.Homologous nodes but hint from other haplotype¶

In [47]:
image_path = "/vf/users/Phillippy/projects/giraffeT2T/assembly/script/post_verkko_pkg/data/test_giraffe/fig/gapid_24.png"
display(Image(filename=image_path,width=500, height = 500))
No description has been provided for this image
In [48]:
%%time
node_list_input = ['utig4-282', 'utig4-2090']
vf.pp.searchNodes(obj, node_list_input)
`obj.paths_freq` already exists.
Extracting paths containing nodes: ['utig4-282', 'utig4-2090']
CPU times: user 7.98 ms, sys: 1 ms, total: 8.98 ms
Wall time: 8.33 ms
Out[48]:
  path size @utig4-282@ @utig4-2090@
6386 <utig4-282>utig4-283>utig4-2090 90 Y Y
3498 <utig4-2090<utig4-283>utig4-282 88 Y Y
3500 <utig4-2090<utig4-284>utig4-282 85 Y Y
6388 <utig4-282>utig4-284>utig4-2090 82 Y Y
3496 <utig4-2090 1792 Y
12124 >utig4-2090 1773 Y
14794 >utig4-282 559 Y
6384 <utig4-282 543 Y
6392 <utig4-285<utig4-282 101 Y
6394 <utig4-286<utig4-282 94 Y
14796 >utig4-282>utig4-286 92 Y
7877 <utig4-808>utig4-809<utig4-2090 90 Y
14795 >utig4-282>utig4-285 86 Y
12126 >utig4-2090<utig4-809>utig4-808 76 Y
7879 <utig4-808>utig4-810<utig4-2090 74 Y
12128 >utig4-2090<utig4-810>utig4-808 72 Y
3497 <utig4-2090<utig4-283 9 Y
16352 >utig4-809<utig4-2090 9 Y
6389 <utig4-283>utig4-282 8 Y
12125 >utig4-2090<utig4-809 6 Y
14797 >utig4-283>utig4-2090 6 Y
14798 >utig4-284>utig4-2090 6 Y
6385 <utig4-282>utig4-283 5 Y
12127 >utig4-2090<utig4-810 5 Y
16390 >utig4-810<utig4-2090 5 Y
6387 <utig4-282>utig4-284 3 Y
6390 <utig4-284>utig4-282 3 Y
3499 <utig4-2090<utig4-284 1 Y

Here, we can identify two clear paths: <utig4-282>utig4-283>utig4-2090 and <utig4-282>utig4-284>utig4-2090. However, we are unable to determine the haplotype using ONT alignment. In such cases, one of the obvious paths might already be used by another haplotype. Thus, we can assign the haplotype harboring this gap to the other obvious path, using an elimination method.

In [49]:
vf.pp.highlight_nodes(obj, node = "utig4-282-")
namepathassignmentgaps
sire_compressed.k31.hapmer_from_utig4-457utig4-2548+,utig4-2549+,utig4-2661-,utig4-916-,utig4-913-,utig4-915+,utig4-1121-,utig4-1122+,utig4-2641-,utig4-1380-,utig4-1379+,utig4-1382+,utig4-2677-,utig4-2585-,utig4-2584-,utig4-2483-,utig4-2482-,utig4-2026-,utig4-2023-,utig4-2025+,utig4-2601+,utig4-285-,utig4-282-,[N5000N:ambig_path],utig4-2090+,utig4-809-,utig4-808+,utig4-812+,utig4-2691+,utig4-355-,utig4-353+,utig4-357+,utig4-1350-,utig4-1351+,[N5000N:ambig_path],utig4-6+,utig4-14+,utig4-1317+,utig4-1320+,utig4-2617+,utig4-867-,utig4-865-,utig4-866+,utig4-869+,utig4-1097+,utig4-840-,utig4-839+,utig4-53-,utig4-51+,utig4-54+,utig4-1979-,utig4-456-,utig4-454+,utig4-457+,utig4-1300-,utig4-1302+,utig4-2170+,utig4-1373-,utig4-1371+,utig4-1375+,utig4-2262-,utig4-1946-,utig4-1943-,utig4-1945+,utig4-2512-,utig4-2514+,utig4-40-,[N5000N:ambig_path],utig4-39+,[N1000N:ambig_bubble],utig4-119-,utig4-120+,utig4-1010-,utig4-1012+,utig4-1263-,utig4-1265+,utig4-2622+,utig4-2473-,utig4-2471+,utig4-1448-,utig4-1447+,utig4-1451+,utig4-2186-,utig4-2187+,utig4-1969-,utig4-1968-,utig4-470-,utig4-467-,utig4-469+,utig4-887+,utig4-888+,utig4-2438-,utig4-2440+,utig4-2469+,utig4-2199-,utig4-2198+,utig4-2202+,utig4-2251-,utig4-1669-,utig4-1667+,utig4-1671+,utig4-1868+,utig4-803-,utig4-800-,utig4-802+,utig4-885+,utig4-511-,utig4-509+,utig4-765-,utig4-766+,utig4-2122-,utig4-2124+,utig4-2366+,utig4-352-,utig4-348-,utig4-350+,utig4-2621+,utig4-1188-,utig4-1186+,utig4-1190+,utig4-1191+,utig4-2078-,utig4-2077-,utig4-1141-,utig4-1139+,utig4-1092-,utig4-1089-,utig4-1091+,utig4-1389+,utig4-1391+,utig4-2008-,utig4-28-,utig4-27+,utig4-31+,utig4-823+,utig4-825+SIRE_COMPRESSED.K31.HAPMERNone
dam_compressed.k31.hapmer_from_utig4-458utig4-2548+,utig4-2550+,utig4-2661-,utig4-917-,utig4-913-,utig4-914+,utig4-1121-,utig4-1123+,utig4-2641-,utig4-1381-,utig4-1379+,utig4-1383+,utig4-2677-,utig4-2586-,utig4-2584-,utig4-2484-,utig4-2482-,utig4-2027-,utig4-2023-,utig4-2024+,utig4-2601+,utig4-286-,utig4-282-,utig4-283+,utig4-2090+,utig4-810-,utig4-808+,utig4-811+,utig4-2691+,utig4-354-,utig4-353+,utig4-356+,utig4-1350-,utig4-1318-,utig4-1317+,utig4-1319+,utig4-2617+,utig4-868-,utig4-865-,utig4-289-,utig4-287+,utig4-1097+,utig4-841-,utig4-839+,utig4-52-,utig4-51+,utig4-55+,utig4-1979-,utig4-455-,utig4-454+,utig4-458+,utig4-1300-,utig4-1301+,utig4-2170+,utig4-1372-,utig4-1371+,utig4-1374+,utig4-2262-,utig4-1947-,utig4-1943-,utig4-1944+,utig4-2512-,utig4-2513+,utig4-2515+,utig4-2612-,utig4-2610-,utig4-2608+,[N5000N:ambig_path],utig4-2527-,utig4-1013-,utig4-1010-,utig4-1011+,utig4-1263-,utig4-1264+,utig4-2622+,utig4-2472-,utig4-2471+,utig4-1449-,utig4-1447+,utig4-1450+,utig4-2186-,utig4-1972-,utig4-1971-,utig4-1968-,utig4-471-,utig4-467-,utig4-468+,utig4-887+,utig4-889+,utig4-2438-,utig4-2439+,utig4-2469+,utig4-2199-,utig4-2198+,utig4-2203+,utig4-2251-,utig4-1668-,utig4-1667+,utig4-1670+,utig4-1868+,utig4-804-,utig4-800-,utig4-801+,utig4-885+,utig4-886+,utig4-768-,utig4-765-,utig4-767+,utig4-2122-,utig4-2123+,utig4-2366+,utig4-351-,utig4-348-,utig4-349+,utig4-2621+,utig4-1187-,utig4-1186+,utig4-1189+,utig4-2077-,utig4-1140-,utig4-1139+,utig4-1093-,utig4-1089-,utig4-1090+,utig4-1389+,utig4-1390+,utig4-2008-,utig4-29-,utig4-27+,utig4-30+,utig4-823+,utig4-824+DAM_COMPRESSED.K31.HAPMERNone

When we search for the utig4-282- node in the paths, the maternal (dam) strand has already used the utig4-283+ node. Therefore, we can use the utig4-284+ node to fill the gap on the paternal (sire) strand.

In [43]:
obj = vf.pp.fillGaps(obj=obj,
               gapId='gapid_24',
               final_path="utig4-282-, utig4-284+, utig4-2090+")
final path : <utig4-282>utig4-284>utig4-2090
Updated gapId gapid_24!
 
✅ The start node and its direction match the original node.
✅ The start node and its direction match the original node.
[===                                               ] 3/43 gaps filled 

7. Homologous without hint¶

In [51]:
image_path = "/vf/users/Phillippy/projects/giraffeT2T/assembly/script/post_verkko_pkg/data/test_giraffe/fig/gapid_10.png"
display(Image(filename=image_path,width=500, height = 500))
No description has been provided for this image
In [26]:
%%time
node_list_input = ['utig4-2613', 'utig4-626']
vf.pp.searchNodes(obj, node_list_input)
`obj.paths_freq` already exists.
Extracting paths containing nodes: ['utig4-2613', 'utig4-626']
CPU times: user 24.8 ms, sys: 1e+03 µs, total: 25.8 ms
Wall time: 25.5 ms
Out[26]:
  path size @utig4-2613@ @utig4-626@
15775 >utig4-626>utig4-629<utig4-2613 108 Y Y
13876 >utig4-2613<utig4-629<utig4-626 101 Y Y
13878 >utig4-2613<utig4-630<utig4-626 57 Y Y
15777 >utig4-626>utig4-630<utig4-2613 53 Y Y
7300 <utig4-626 5187 Y
15773 >utig4-626 5173 Y
13874 >utig4-2613 2813 Y
5199 <utig4-2613 2752 Y
15728 >utig4-612>utig4-616>utig4-2613 112 Y
5201 <utig4-2613<utig4-615<utig4-612 86 Y
5203 <utig4-2613<utig4-616<utig4-612 83 Y
15726 >utig4-612>utig4-615>utig4-2613 76 Y
7305 <utig4-626>utig4-628<utig4-2521 68 Y
7302 <utig4-626>utig4-627<utig4-2521 62 Y
13536 >utig4-2521<utig4-628>utig4-626 54 Y
13534 >utig4-2521<utig4-627>utig4-626 49 Y
7311 <utig4-628>utig4-626 38 Y
7309 <utig4-627>utig4-626 31 Y
7301 <utig4-626>utig4-627 30 Y
15776 >utig4-626>utig4-630 30 Y
7304 <utig4-626>utig4-628 26 Y
7315 <utig4-630<utig4-626 26 Y
7306 <utig4-626>utig4-628<utig4-2521<utig4-532 23 Y
7303 <utig4-626>utig4-627<utig4-2521<utig4-532 20 Y
15790 >utig4-630<utig4-2613 20 Y
13877 >utig4-2613<utig4-630 14 Y
7312 <utig4-629<utig4-626 6 Y
5202 <utig4-2613<utig4-616 5 Y
13875 >utig4-2613<utig4-629 5 Y
15731 >utig4-615>utig4-2613 5 Y
15788 >utig4-629<utig4-2613 5 Y
15732 >utig4-616>utig4-2613 4 Y
5200 <utig4-2613<utig4-615 3 Y
15513 >utig4-532>utig4-2521<utig4-627>utig4-626 2 Y
15774 >utig4-626>utig4-629 2 Y
7307 <utig4-626>utig4-628<utig4-2521<utig4-534 1 Y
14849 >utig4-302>utig4-532>utig4-2521<utig4-628>utig4-626 1 Y
14856 >utig4-303>utig4-534>utig4-2521<utig4-627>utig4-626 1 Y
15518 >utig4-534>utig4-2521<utig4-628>utig4-626 1 Y
In [53]:
vf.pp.highlight_nodes(obj, node = "utig4-2613+")
namepathassignmentgaps
dam_compressed.k31.hapmer_from_utig4-1425utig4-857-,utig4-854-,utig4-851-,utig4-853+,utig4-2710+,utig4-2454-,utig4-2453-,utig4-59-,utig4-58+,utig4-62+,utig4-2589-,utig4-1175-,utig4-1173-,utig4-613-,utig4-612+,utig4-616+,utig4-2613+,[N1000N:ambig_bubble],utig4-626-,utig4-627+,utig4-2521-,utig4-534-,utig4-303-,utig4-301+,utig4-305+,utig4-669-,utig4-671+,utig4-928-,utig4-929+,utig4-2675+,utig4-1922-,utig4-1920+,utig4-73-,utig4-71+,utig4-74+,utig4-1424-,utig4-1425+,utig4-1845-,utig4-1847+,utig4-2685-,utig4-1587-,utig4-1584-,utig4-1585+,utig4-2457-,utig4-2050-,utig4-2049+,utig4-2052+,utig4-2291-,utig4-1197-,utig4-1193-,utig4-1195+,utig4-2194+,utig4-1865-,utig4-1863+,utig4-1867+,utig4-2382-,utig4-589-,utig4-587+,utig4-591+,utig4-1247+,utig4-1248+,utig4-2016+,utig4-1357-,utig4-1355+,utig4-1358+,utig4-2383-,utig4-2241-,utig4-2239-,utig4-1420-,utig4-1416-,utig4-1417+,utig4-1762+,utig4-1764+,utig4-2554-,utig4-2553-,utig4-2551+,utig4-1767-,utig4-1765+,utig4-1769+,utig4-2487-,utig4-963-,utig4-962+,utig4-966+,utig4-2357-,utig4-902-,utig4-898-,utig4-899+,utig4-1675-,utig4-1677+,utig4-2225+,utig4-1653-,utig4-1651+,utig4-111-,utig4-112+,utig4-2250+,utig4-1259-,utig4-1257+,utig4-1260+,utig4-2540-,utig4-2160-,utig4-2158+,utig4-2092-,utig4-2091+,utig4-2094+,utig4-2494-,utig4-2010-,utig4-2009-,utig4-1723-,utig4-1725+,utig4-1727+,utig4-2420+,utig4-1639-,utig4-1636-,utig4-1637+,utig4-2620-,utig4-1953-,utig4-1951+,utig4-1954+,utig4-2188+,utig4-1402-,utig4-1400-DAM_COMPRESSED.K31.HAPMERNone
sire_compressed.k31.hapmer_from_utig4-1426utig4-996-,[N34856N:tangle],utig4-856-,utig4-854-,utig4-851-,utig4-852+,utig4-2710+,utig4-2455-,utig4-2453-,utig4-60-,utig4-58+,utig4-61+,utig4-2589-,utig4-1174-,utig4-1173-,utig4-614-,utig4-612+,utig4-615+,utig4-2613+,[N1000N:ambig_bubble],utig4-626-,utig4-628+,utig4-2521-,utig4-532-,utig4-302-,utig4-301+,utig4-304+,utig4-669-,utig4-670+,utig4-928-,utig4-930+,utig4-2675+,utig4-1921-,utig4-1920+,utig4-72-,utig4-71+,utig4-75+,utig4-1424-,utig4-1426+,utig4-1845-,utig4-1846+,utig4-2685-,utig4-1588-,utig4-1584-,utig4-1586+,utig4-2457-,utig4-2051-,utig4-2049+,utig4-2053+,utig4-2291-,utig4-1196-,utig4-1193-,utig4-1194+,utig4-2194+,utig4-1864-,utig4-1863+,utig4-1866+,utig4-2382-,utig4-588-,utig4-587+,utig4-590+,utig4-1247+,utig4-1249+,utig4-2016+,utig4-1356-,utig4-1355+,utig4-1359+,utig4-2383-,utig4-2240-,utig4-2239-,utig4-1419-,utig4-1416-,utig4-1418+,[N5000N:ambig_path],utig4-1763+,utig4-2554-,utig4-2552-,utig4-2551+,utig4-1766-,utig4-1765+,utig4-1768+,utig4-2487-,utig4-964-,utig4-962+,utig4-965+,utig4-2357-,utig4-901-,utig4-898-,utig4-900+,utig4-1675-,utig4-1676+,utig4-2225+,utig4-1652-,utig4-1651+,utig4-1127-,utig4-1128+,utig4-2250+,utig4-1258-,utig4-1257+,utig4-1261+,utig4-2540-,utig4-2159-,utig4-2158+,utig4-2093-,utig4-2091+,utig4-2095+,utig4-2494-,utig4-2011-,utig4-2009-,utig4-1728-,utig4-1725+,utig4-1726+,[N5000N:ambig_path],utig4-1640-,utig4-1636-,utig4-1638+,utig4-2620-,utig4-1952-,utig4-1951+,utig4-1955+,utig4-2188+,utig4-1401-,utig4-1400-SIRE_COMPRESSED.K31.HAPMERNone

This scenario has the same haplotype unassignment issue as above, but there is no hint from the other haplotype. Therefore, we randomly assign the haplotype to each gap using the paths obtained from vf.pp.searchNodes(obj, node_list_input). This haplotype-switching error can be corrected during the polishing step later.

In [44]:
obj = vf.pp.fillGaps(obj=obj,
               gapId='gapid_10',
               final_path="utig4-2613+, utig4-630-, utig4-626-")
final path : >utig4-2613<utig4-630<utig4-626
Updated gapId gapid_10!
 
✅ The start node and its direction match the original node.
✅ The start node and its direction match the original node.
[====                                              ] 4/43 gaps filled 
In [45]:
obj = vf.pp.fillGaps(obj=obj,
               gapId='gapid_12',
               final_path="utig4-2613+, utig4-629-, utig4-626-")
final path : >utig4-2613<utig4-629<utig4-626
Updated gapId gapid_12!
 
✅ The start node and its direction match the original node.
✅ The start node and its direction match the original node.
[=====                                             ] 5/43 gaps filled 

Finalize paths¶

In [13]:
vf.pp.checkGapFilling(obj)
[=====                                             ] 5/43 gaps filled 
Out[13]:
gapId name gaps notes fixedPath startMatch endMatch finalGaf done
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)

The final Rukki path file to be used in the Verkko CNS run can be generated using the fixed gap file in the obj and written to the local directory for future use.

In [47]:
vf.pp.writeFixedGaf(obj)
Reading original rukki path from 8-hicPipeline/rukki.paths.gaf
Writing fixed rukki path to final_rukki_fixed.paths.gaf

Save fillet obj¶

The Verkko-fillet object can be saved as a Python pickle object using the vf.pp.save_Verkko() function. This object will be written to the Verkko output directory.

In [49]:
%%time

fileName = "verkko-fillet.pkl"
vf.pp.save_Verkko(obj, fileName)
save verkko fllet obj to -> verkko-fillet.pkl
CPU times: user 9.31 s, sys: 1.45 s, total: 10.8 s
Wall time: 11.6 s
In [64]:
session_info.show()
Out[64]:
Click to view session information
-----
itables             2.2.3
pandas              2.2.3
session_info        1.0.0
verkkoFillet        NA
-----
Click to view modules imported as dependencies
PIL                         10.0.0
anyio                       NA
arrow                       1.2.3
asttokens                   NA
attr                        23.1.0
attrs                       23.1.0
babel                       2.12.1
certifi                     2023.07.22
cffi                        1.15.1
charset_normalizer          3.2.0
comm                        0.1.4
cycler                      0.10.0
cython_runtime              NA
dateutil                    2.9.0.post0
debugpy                     1.6.7.post1
decorator                   5.1.1
defusedxml                  0.7.1
exceptiongroup              1.2.2
executing                   2.1.0
fastjsonschema              NA
fqdn                        NA
idna                        3.4
importlib_metadata          NA
importlib_resources         NA
ipykernel                   6.25.1
ipywidgets                  8.1.0
isoduration                 NA
jedi                        0.19.2
jinja2                      3.1.2
json5                       NA
jsonpointer                 2.4
jsonschema                  4.19.0
jsonschema_specifications   NA
jupyter_events              0.7.0
jupyter_server              2.7.2
jupyterlab_server           2.24.0
kaleido                     0.2.1
kiwisolver                  1.4.4
markupsafe                  2.1.3
matplotlib                  3.9.3
matplotlib_inline           0.1.7
mpl_toolkits                NA
natsort                     8.4.0
nbformat                    5.9.2
numpy                       1.26.0
overrides                   NA
packaging                   23.1
parso                       0.8.4
patsy                       0.5.4
pexpect                     4.9.0
pickleshare                 0.7.5
pkg_resources               NA
platformdirs                3.10.0
plotly                      5.16.1
prometheus_client           NA
prompt_toolkit              3.0.48
psutil                      5.9.5
ptyprocess                  0.7.0
pure_eval                   0.2.3
pyarrow                     17.0.0
pydev_ipython               NA
pydevconsole                NA
pydevd                      2.9.5
pydevd_file_utils           NA
pydevd_plugins              NA
pydevd_tracing              NA
pygments                    2.18.0
pyparsing                   3.0.9
pythonjsonlogger            NA
pytz                        2024.1
referencing                 NA
requests                    2.31.0
rfc3339_validator           0.1.4
rfc3986_validator           0.1.1
rpds                        NA
scipy                       1.11.4
seaborn                     0.12.2
send2trash                  NA
six                         1.16.0
sniffio                     1.3.0
stack_data                  0.6.2
statsmodels                 0.14.1
tenacity                    NA
tornado                     6.3.3
tqdm                        4.66.1
traitlets                   5.14.3
typing_extensions           NA
uri_template                NA
urllib3                     2.0.4
wcwidth                     0.2.13
webcolors                   1.13
websocket                   1.6.1
yaml                        6.0.1
zipp                        NA
zmq                         25.1.1
zoneinfo                    NA
-----
IPython             8.18.1
jupyter_client      8.3.0
jupyter_core        5.3.1
jupyterlab          4.0.5
notebook            7.0.2
-----
Python 3.9.17 | packaged by conda-forge | (main, Aug 10 2023, 07:02:31) [GCC 12.3.0]
Linux-4.18.0-425.19.2.el8_7.x86_64-x86_64-with-glibc2.28
-----
Session information updated at 2024-12-11 00:45

Once you save the Verkko-Fillet object locally, you can load all datasets by loading the object itself, without needing to read them individually next time.

In [5]:
%%time

fileName = "verkko-fillet.pkl"
obj = vf.pp.load_Verkko(fileName)
load verkko fllet obj from <- verkko-fillet.pkl
CPU times: user 18min 41s, sys: 7.15 s, total: 18min 48s
Wall time: 18min 52s

Make new verkko directory for running verkko consensus¶

To run the Verkko consensus steps using the manual gap-filling path file, the first step is to generate the directory structure required for Verkko to start from the consensus step. The vf.pp.mkCNSdir function helps create this structure in the specified location by copying or creating symbolic links from the Verkko output directory.

In [175]:
%%time 

new_folder_path = "/data/Phillippy/projects/giraffeT2T/assembly/verkko2.2_hifi-duplex_trio-hic/verkko-cns-test"
vf.pp.mkCNSdir(obj, new_folder_path)
Symbolic links and files created from /data/Phillippy/projects/giraffeT2T/assembly/verkko2.2_hifi-duplex_trio-hic/verkko-thic_legacy/ to /data/Phillippy/projects/giraffeT2T/assembly/verkko2.2_hifi-duplex_trio-hic/verkko-cns-test
 
/data/Phillippy/projects/giraffeT2T/assembly/verkko2.2_hifi-duplex_trio-hic/verkko-cns-test
├── 1-buildGraph
├── 2-processGraph
├── 3-align
├── 3-alignTips
├── 4-processONT
├── 5-untip
├── 6-layoutContigs
├── 6-rukki
├── 7-consensus
└── hifi-corrected.fasta.gz
CPU times: user 0 ns, sys: 7.88 s, total: 7.88 s
Wall time: 19.5 s

Update the new Verkko directory to include ONT support for filling the edge at blunt missing edge gaps, if your assembly has this issue.

In [ ]:
%%time

vf.pp.updateCNSdir_missingEdges(obj, new_folder_path)
In [ ]:
 
In [ ]: